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IN THE CLAIMS - Following is the list of claims and their status: 

1. (Currently Amended) A computer-assisted method for identifying 
duplicate and near-duplicate documents in a large collection of documents, comprising the steps 
of: 

initially, selecting distinctive features contained in the collection of documents, 
then, for each document, identifying the distinctive features contained in the 

document, and 

then, for each pair of documents having at least one distinctive feature in 
common, comparing the distinctive features of the documents to determine whether the 
documents are duplicate or near-duplicate documents^ 

wherein the distincti ve features are text fragments, which are sequences of at least 
two words that appear in a limited number of documents in the document collection, 

wherein the text fragments are determined to be distinctive features based upon 
a function of the frequency of a text fragment within a document in the large collection of 
documents . 

2. (Original) The computer-assisted method according to claim 1, wherein 
the method is applied to removing duplicates in document collections. 

3. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to detecting plagiarism. 
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4. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to detecting copyright infringement. 

5. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to determine the authorship of a document. 

6. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to clustering successive versions of a document from among a 
collection of documents. 

7. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to seeding a text classification or text clustering algorithm with 
sets of duplicate or near-duplicate documents. 

8. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to matching an e-mail message with responses to the e-mail 
message. 

9. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to matching responses to an e-mail message with the e-mail 
message. 
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10. (Original) The computer-assisted method according to claim 1, 
wherein the method is applied to creating a document index for use with a query system to 
efficiently find documents in response to a query which contain a particular phrase or excerpt. 



11. (Original) The computer-assisted method according to claim 10, 

wherein the document index can be utilized even if the particular phrase or excerpt was not 
recorded correctly in the document or in the query. 



12. (Original) The computer-assisted method according to claim to 1, 

wherein the distinctive features appear in a different order in each of the documents. 



13. (Cancelled) 



14. (Currently Amended) The computer-assisted method according to claim 
l_-t3, wherein the method is applied to information retrieval methods. 

1 5 . (Previously Presented) The computer-assisted method according to claim 
14, wherein a text classification method is applied to the information retrieval method is a t e xt 
classification m e thod . 

16. (Original) The computer-assisted method according to claim 14, 

wherein: 

the information retrieval method assumes word independence, and 
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the distinctive text fragments are added to an index set. 

17. (Cancelled) 

1 8. (Currently Amended) The computer-assisted method according to claim 
1 44, wherein if one distinctive text fragment is contained within another distinctive text 
fragment within the same document, only the longest distinctive text fragment is considered as 
a distinctive feature. 

19. (Cancelled) 

20. (Currently Amended) The computer-assisted method according to claim 
wherein the sequences of at least two words are considered as appearing in a. document 

when the document contains the sequence of at l e ast two words at l e ast a user-specified 
minimum frequency. 

2 1 . (Currently Amended) The computer-assisted method according to claim 

I_4-7> wherein: 

for each sequence of at least two words, a distinctiveness score is calculated, and 
the highest scoring sequences that are found in at least two documents in the 
document collection are considered distinctive text fragments. 
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22. (Currently Amended) The computer-assisted method according to claim 
21, wherein the distinctiveness score is the reciprocal of the number of documents containing the 
phrase text fragment multiplied by a monotonic function of the number of words in the phras e 
text fragment . 

23. (Currently Amended) The computer-assisted method according to claim 
22 2+ , wherein the monotonic function is the number of words in the phrase text fragment . 

24. (Currently Amended) The computer-assisted method according to claim 
21, wherein the distinctiveness score is the percentage of documents not containing the phrase 
multiplied by a monotonic function of the number of words in the phrase text fragment . 

25. (Currently Amended) The computer-assisted method according to claim 
24, wherein the monotonic function is the number of words in the phrase text fragment . 

26. (Currently Amended) The computer-assisted method according to claim 
i_4^, wherein the limited number is selected by a user. 

27. (Currently Amended) The computer-assisted method according to claim 
14^?, wherein the limited number is defined by a linear function of the number of documents in 
the document collection. 
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28. (Currently Amended) The computer-assisted method according to claim 
l_4-7, wherein the distinctive text fragments include glue words. 

29. (Original) The computer-assisted method according to claim 28, 
wherein the glue words do not appear at either extreme of the distinctive text fragments. 

30. (Original) The computer-assisted method according to claim 1, 
further including the step of for each pair of documents having at least one distinctive feature in 
common, counting the number of distinctive features in common, 

wherein determining whether the pair of documents is duplicates or near- 
duplicates includes the steps of: 

for each pair of documents, calculating an overlap ratio by dividing the number 
of distinctive features in common by the smaller of the number of distinctive features per 
document, and 

comparing the overlap ratio to a threshold and if the overlap ratio is greater than 
the threshold, then the pair of documents are duplicates or near-duplicates, otherwise the pair of 
documents is not duplicates or near-duplicates. 

31. (Original) The computer-assisted method according to claim 30, 
further including the steps of: 

building a document index that maps each document to its associated distinctive 
features, wherein if one distinctive feature is repeated within one document, the index maps the 
document to the distinctive feature once, and 
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building a feature index that maps each distinctive feature to its associated 
document, wherein if one distinctive feature is repeated within one document, the index maps 
the distinctive feature to the document once, 

wherein determining whether the pair of documents are duplicates or near- 
duplicates further includes the steps of: 



unique distinctive feature, and 

for each document, creating a list of documents that have at least one feature in 
common with the document and the number of features in common with the document. 



creating a list of unique distinctive features from the document index, 



for each unique distinctive feature, creating a list of documents which contain the 



32. (Original) 



The computer-assisted method according to claim 31, 



wherein the distinctive features include distinctive phrases. 



33. (Original) 



The computer-assisted method according to claim to 31, 



wherein the distinctive features appear in a different order in each of the documents. 



34. (Original) 



The computer-assisted method according to claim 31, 



wherein the distinctive features include text spans. 



35. (Original) 



The computer- assisted method according to # claim 34, 



wherein the text spans include sentences. 
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36. (Original) The computer-assisted method according to claim 34, 
wherein the text spans include lines of text. 

37. (Currently Amended) A computer- assisted method for identifying 
duplicate and near-duplicate text spans in a large collection of text spans, comprising the steps 
of: 

initially, selecting distinctive features contained in the collection of text spans, 
then, for each text span, identifying the distinctive features contained in the text 

span, and 

then, for each pair of text spans having at least one distinctive feature in common, 
comparing the distinctive features of the text spans to determine whether the text spans are 
duplicate or near-duplicate text spans A 

wherein the distinctive features are text fragments, which are sequences of at least 
two words that appear in a limited number of text spans in the large collection of text spans. 

wherein the text fragments are determined to be distinctive features based upon 
a function of the frequency of a text span in the large collection of text spans . 

38. (Original) The computer-assisted method according to claim 37, 
wherein the text spans are sentences. 

39. (Currently Amended) An apparatus to enable a method for identifying 
duplicate and near-duplicate documents in a large collection of documents, comprising: 
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a means for initially selecting distinctive features contained in the collection of 



documents; 



a means for subsequently identifying the distinctive features contained in each 



document; and 



a means for then comparing the distinctive features of each pair of documents 



having at least one distinctive feature in common to determine whether the documents are 
duplicate or near-duplicate documents^ 

wherein the distincti ve features are text fragments, which are sequences of at least 
two words that appear in a limited number of documents in the document collection, 

wherein the text fragments are determined to be distinctive features based upon 
a function of the frequency of a text fragment within a document in the large collection of 
documents. 
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